HMM-based motion trajectory generation for speech animation synthesis
نویسندگان
چکیده
Synthesis of realistic facial animation for arbitrary speech is an important but difficult problem. The difficulties lie in the synchronization between lip motion and speech, articulation variation under different phonetic context, and expression variation in different speaking style. To solve these problems, we propose a visual speech synthesis system based on a fivestate, multi-stream HMM, which generates synchronized motion trajectories for the given text and speech input. Since the motion and the speech are modeled as different but coherent streams, the synchronization at each state is guaranteed. By considering phonetic context and suprasegmental information, the contextual dependent phone models are constructed and clustered using classification and regression, which capture the variable phonetic context and speaking style. The experiment results show that the HMMbased method can generate realistic lip animation while keeping the detailed articulation and transitions. Moreover, it is capable of presenting articulation variation under different phonetic context and expressing various speaking styles, such as emphasized speech. 1. System Introduction Synthesizing realistic and human-like speech animations is a challenging research topic in both speech and animation communities. Manual approaches which typically involve with selecting/creating key-frames as a basis for generating continuous and natural animations, is a painstaking and tedious task, even for a skillful animator. On the other hand, facial motion capture, widely used in entertainment industry, can acquire high-fidelity motion data. However, it has two main problems: (1) the cost in time and equipment and (2) all needed motion must be recorded beforehand. Therefore, automatic animation synthesis is more desirable, if it can be rendered from pre-recorded motion capture data. In this work, we propose a novel data-driven speech animation synthesis method. The idea is analogous to the HMM-based speech synthesis technique which forms utterances by predicting the most likely speech parameters from statistically trained HMMs. Given speech and text input, the proposed system can then generate (synthesize) the most likely motion trajectories of both head and critical markers on the face statistically. The synthesized motion trajectories are transformed into control parameters to drive a lively 2D/3D cartoon head. As shown in Fig.1, the proposed system can be accomplished in four steps: (1) data collection; (2) model training; (3) motion trajectory generation; and (4) animation retargeting. In data collection, with a motion capture system, abundant facial markers’ motion trajectories data are collected along with simultaneous audio(speech) and video recordings. The recordings cover rich phonetic (speech) contexts, different speaking styles, lively emotions, and natural facial expressions. In model training, HMM is trained to model captured motion trajectories statistically in the maximum likelihood sense. Since the motion and speech are modeled as different but coherent streams in HMM modeling, the synchronization between motion and speech at each phoneme state is automatically imposed. By considering phonetic context and supra-segmental information, the animation models for each contextual dependent phone are constructed and clustered into a classification and regression tree to characterize the coarticulatory variations in different speaking styles. In motion trajectory synthesis, statistically trained HMMs are used to generate (predict) the most likely motion trajectories, given acoustic and prosodic features of speech. Final rendering of a 2D/3D talking head is done by transforming marker motion trajectories into head and facial control parameters for synthesizing a lively animation sequence. By using animation retargeting techniques, the system can drive any reasonable facial mesh. Fig.1: Flowchart of speech animation synthesis system. Objective evaluation was conducted by comparing the synthesized facial motion against the captured motion (i.e. the ground truth). The results show that the proposed method is effective for producing realistic speech animations. Subjective comparison with the conventional key-frame based animation synthesis showed that, the HMM-based method can generate more natural lip movements and render realistic coarticulation sequences. AVSP 2009 -International Conference on Audio-Visual Speech Processing University of East Anglia, Norwich, UK September 10--13, 2009 ISCAArchive http://www.isca-speech.org/archive
منابع مشابه
Speech-driven lip motion generation with a trajectory HMM
Automatic speech animation remains a challenging problem that can be described as finding the optimal sequence of animation parameter configurations given some speech. In this paper we present a novel technique to automatically synthesise lip motion trajectories from a speech signal. The developed system predicts lip motion units from the speech signal and generates animation trajectories autom...
متن کاملSpeech-driven Animation using Multi-modal Hidden Markov Models
The main objective of this thesis was the synthesis of speech synchronised motion, in particular head motion. The hypothesis that head motion can be estimated from the speech signal was confirmed. In order to achieve satisfactory results, a motion capture data base was recorded, a definition of head motion in terms of articulation was discovered, a continuous stream mapping procedure was develo...
متن کاملAn introduction of trajectory model into HMM-based speech synthesis
In the synthesis part of a hidden Markov model (HMM) based speech synthesis system which we have proposed, a speech parameter vector sequence is generated from a sentence HMM corresponding to an arbitrarily given text by using a speech parameter generation algorithm. However, there is an inconsistency: although the speech parameter vector sequence is generated under the constraints between stat...
متن کاملModulation spectrum-constrained trajectory training algorithm for HMM-based speech synthesis
This paper presents a novel training algorithm for Hidden Markov Model (HMM)-based speech synthesis. One of the biggest issues causing significant quality degradation in synthetic speech is the over-smoothing effect often observed in generated speech parameter trajectories. Recently, we have found that a Modulation Spectrum (MS) of the generated speech parameters is sensitively correlated with ...
متن کاملA speech parameter generation algorithm using local variance for HMM-based speech synthesis
This paper proposes a parameter generation algorithm using local variance (LV) constraint of spectral parameter trajectory for HMM-based speech synthesis. In the parameter generation process, we take account of both the HMM likelihood of speech feature vectors and a likelihood for LVs. To model LV precisely, we use dynamic features of LV with context-dependent HMMs. The objective experimental r...
متن کاملذخیره در منابع من
با ذخیره ی این منبع در منابع من، دسترسی به آن را برای استفاده های بعدی آسان تر کنید
عنوان ژورنال:
دوره شماره
صفحات -
تاریخ انتشار 2009